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Abstract: We present some nonparametric methods for graphical modeling. In the 
discrete case, where the data are binary or drawn from a finite alphabet, Markov 
random fields are already essentially nonparametric, since the cliques can take only a 
finite number of values. Continuous data are different. The Gaussian graphical model 
t-H is the standard parametric model for continuous data, but it makes distributional 

assumptions that are often unrealistic. We discuss two approaches to building more 
flexible graphical models. One allows arbitrary graphs and a nonparametric extension 
of the Gaussian; the other uses kernel density estimation and restricts the graphs to 
trees and forests. Examples of both methods are presented. We also discuss possible 
future research directions for nonparametric graphical modeling. 
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1. Introduction 

This paper presents two methods for constructing nonparametric graphical models for con- 
tinuous data. In the discrete case, where the data are binary or drawn from a finite alphabet, 
Markov random fields or log-linear models are already essentially nonparametric, since the 
cliques can take only a finite number of values. Continuous data are different. The Gaus- 
sian graphical model is the standard parametric model for continuous data, but it makes 
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distributional assumptions that are typically unrealistic. Yet few practical alternatives to 
the Gaussian graphical model exist, particularly for high dimensional data. We discuss two 
approaches to building more flexible graphical models that exploit sparsity. These two ap- 
proaches are at different extremes in the array of choices available. One allows arbitrary 
graphs, but makes a distributional restriction through the use of copulas; this is a semi- 
parametric extension of the Gaussian. The other approach uses kernel density estimation 
and restricts the graphs to trees and forests; in this case the model is fully nonparametric, 
at the expense of structural restrictions. We describe two-step estimation methods for both 
approaches. We also outline some statistical theory for the methods, and compare them in 
some examples. This article is in part a digest of two recent research articles where these 
methods first appeared, Liu et al. (2009) and Liu et al. (2011). 

The methods we present here are relatively simple, and many more possibilities remain for 
nonparametric graphical modeling. But as we hope to demonstrate, a little nonparametricity 
can go a long way. 

2. Two Families of Nonparametric Graphical Models 

The graph of a random vector is a useful way of exploring the underlying distribution. 
If X = (Xi, . . . , Xd) is a random vector with distribution P, then the undirected graph 
G = (V, E) corresponding to P consists of a vertex set V and an edge set E where V has d 
elements, one for each variable X^. The edge between is excluded from E if and only 
if Xi is independent of Xj given the other variables X\^jy = (X s : 1 < s < d, s ^ 
written 



The general form for a (strictly positive) probability density encoded by an undirected 
graph G is 



where the sum is over all cliques, or fully connected subsets of vertices of the graph. In general, 
this is what we mean by a nonparametric graphical model. It is the graphical model analogue 
of the general nonparametric regression model. Model (2.2) has two main ingredients, the 
graph G and the functions {fc}- However, without further assumptions, it is much too 
general to be practical. The main difficulty in working with such a model is the normalizing 
constant Z(f), which cannot, in general, be efficiently computed or approximated. 

In the spirit of nonparametric estimation, we can seek to impose structure on either 
the graph or the functions fc in order to get a flexible and useful family of models. One 
approach parallels the ideas behind sparse additive models for regression. Specifically, we 
replace the random variable X = (X 1; . . . , Xj) by the transformed random variable f(X) = 



XiUXj X\ {ijj} . 



(2.1) 




(2.2) 
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FlG 1 . Comparison of properties of the nonparanormal and forest- structured densities. 



(fi(Xi), . . . , fd(Xd)), and assume that f(X) is multivariate Gaussian. This results in a non- 
parametric extension of the Normal that we call the nonparanormal distribution. The non- 
paranormal depends on the univariate functions {fj}, and a mean jj, and covariance matrix 
E, all of which are to be estimated from data. While the resulting family of distributions 
is much richer than the standard parametric Normal (the paranormal), the independence 
relations among the variables are still encoded in the precision matrix Q = E -1 , as we show 
below. 

The second approach is to force the graphical structure to be a tree or forest, where each 
pair of vertices is connected by at most one path. Thus, we relax the distributional assump- 
tion of normality but we restrict the allowed family of undirected graphs. The complexity of 
the model is then regulated by selecting the edges to include, using cross validation. 

Figure 1 summarizes the tradeoffs made by these two families of models. The nonpara- 
normal can be thought of as an extension of additive models for regression to graphical 
modeling. This requires estimating the univariate marginals; in the copula approach, this is 
done by estimating the functions fj{x) = (ij + aj^~ 1 (Fj(x)), where Fj is the distribution 
function for variable Xj. After estimating each fj, we transform to (assumed) jointly Normal 
via Z = (fi(Xi), . . . , fd(Xd)) and then apply methods for Gaussian graphical models to esti- 
mate the graph. In this approach, the univariate marginals are fully nonparametric, and the 
sparsity of the model is regulated through the inverse covariance matrix, as for the graphical 
lasso, or "glasso" (Banerjee et al., 2008; Friedman et al., 2007) 1 The model is estimated in a 
two-stage procedure; first the functions fj are estimated, and then inverse covariance matrix 
Q is estimated. The high level relationship between linear regression models, Gaussian graph- 
ical models, and their extensions to additive and high dimensional models is summarized in 
Figure 2. 

In the forest graph approach, we restrict the graph to be acyclic, and estimate the bi- 
variate marginals p(xi,Xj) nonparametrically. In light of equation (4.1), this yields the full 



throughout the paper we use the term graphical lasso, or glasso, coined by Friedman et al. (2007) to refer 
to the solution obtained by t\ -regularized log- likelihood under the Gaussian graphical model. This estimator 
goes back at least to Yuan and Lin (2007), and an iterative lasso algorithm for doing the optimization was 
first proposed by Banerjee et al. (2008). In our experiments we use the R packages glasso (Friedman et al., 
2007) and huge to implement this algorithm. 
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FlG 2 . Comparison of regression and graphical models. The nonparanormal extends additive models to the 
graphical model setting. Regularizing the inverse covariance leads to an extension to high dimensions, which 
parallels sparse additive models for regression. 

nonparametric family of graphical models having acyclic graphs. Here again, the estimation 
procedure is two-stage; first the marginals are estimated, and then the graph is estimated. 
Sparsity is regulated through the edges that are included in the forest. 

Clearly these are just two tractable families within the very large space of possible non- 
parametric graphical models specified by equation (2.2). Many interesting research possibil- 
ities remain for novel nonparametric graphical models that make different assumptions; we 
discuss some possibilities in a concluding section. We now discuss details of these two model 
families, beginning with the nonparanormal. 

3. The Nonparanormal 

We say that a random vector X = (Xi, . . . , Xd) T has a nonparanormal distribution and write 

X ~NPN(tM,X,f) 

in case there exist functions {fj}j =1 such that Z = f(X) ~ iV(/i, £), where f(X) = 
(fi(Xi), . . . , fd(Xd)). When the f/s are monotone and different iable, the joint probability 
density function of X is given by 

Px{x) = (2tt)^|S|V2 exp {~l - ^ ^ - II W x i)l 

where the product term is a Jacobian. 

Note that the density in (3.1) is not identifiable — we could scale each function by a 
constant, and scale the diagonal of S in the same way, and not change the density. To make 
the family identifiable we demand that fj preserves marginal means and variances: 

Hi = E(Zj) = E(Xj) and o) = E# = Var (Zj) = Var (Xj) . (3.2) 

These conditions only depend on diag(E) but not the full covariance matrix. 
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FlG 3. Densities of three 2- dimensional nonparanormals. The left plots have component functions of the form 
fa( x ) — sign(x)\x\ a , with ol\ = 0.9, and oti = 0.8. The center plots have component functions of the form 
9a(x) — \_x\ + 1/(1 + exp(— a(x — \_x\ — 1/2))) with a\ = 10 and a,i = 5, where x — \ x\ is the fractional part. 
The right plots have component functions of the form h a (x) = x + sm.(ax)/a, with ax = 5 and ot2 = 10. In 



Now, let Fj(x) denote the marginal distribution function of Xj. Since the component 
fj(Xj) is Gaussian, we have that 



The form of the density in (3.1) implies that the conditional independence graph of the 
nonparanormal is encoded in Q = X -1 , as for the parametric Normal, since the density 
factors with respect to the graph of Q, and therefore obeys the global Markov property of 
the graph. 

In fact, this is true for any choice of identification restrictions; thus, it is not necessary to 
estimate /i or cr to estimate the graph, as the following result shows. 

Lemma 3.1. Define 



each case /i = (0, 0) and E = ( 1 5 '^) ■ 




which implies that 



(3.3) 




(3.4) 



0. 
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Proof. We can rewrite the covariance matrix as 



Ejfe = Cov(Zj, Z k ) = cr i cr fc Cov(/i J (X J ), h k (X k )). 

Hence E = DAD and 

E- 1 = D -1 A -1 D"""' 1 , 

where Z) is the diagonal matrix with diag(-D) = cr. The zero pattern of A -1 is therefore 
identical to the zero pattern of E -1 . □ 

Figure 3 shows three examples of 2-dimensional nonparanormal densities. The component 
functions are taken to be from three different families of monotonic functions — one using 
power transforms, one using logistic transforms, and another using sinusoids: 

f a (x) = sign(x)|x| a 

1 

g a (x) = [x\ + 



1 + cxp {— a(x — [x\ — |)} 



sm(ax) 

h a (x) = x + 



a 



The covariance in each case is E = ( 51 ) and the mean is u — (0,0). It can be seen how 
the concavity and number of modes of the density can change with different nonlinearities. 
Clearly the nonparanormal family is much richer than the Normal family. 

The assumption that f(X) = (fi(Xi), . . . , fd(X d ) is Normal leads to a semiparametric 
model where only one dimensional functions need to be estimated. But the monotonicity of 
the functions fj, which map onto R, enables computational tractability of the nonparanormal. 
For more general functions /, the normalizing constant for the density 

f 1 rp 

p x {x) oc exp <^ -- (f(x) - n) E (f(x)-n) 
cannot be computed in closed form. 



3.1. Connection to Copula 

If Fj is the distribution of Xj, then Uj = Fj(Xj) is uniformly distributed on (0, 1). Let C 
denote the joint distribution function of U = (Ui, . . . , Ud), and let F denote the distribution 
function of X. Then we have that 



F( Xl ,...,x d ) = F{X 1 <x 1 ,...,X d <x d ) (3.5) 

= F(F 1 (X 1 )<F 1 (x 1 ),...,F d (X d )<F d (x d )) (3.6) 

= ^(U 1 <F 1 (x 1 ),...,U d <F d (x i )) (3.7) 

= C{F l {x 1 ),...,F d {x d )). (3.8) 
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This is known as Sklar's theorem (Sklar, 1959), and C is called a copula. If c is the density 
function of C then 



d 



p(xt, ...,x d ) = c(Fi(xi), . . .,F d (x d )) Y[p( x j) ( 3 -9) 

where p(xj) is the marginal density of Xj. For the nonparanormal we have 

F( Xl , ...,x d ) = ^(^(^(an)), . . . , ^-\F d (x d ))) (3.10) 

where is the multivariate Gaussian cdf and $ is the univariate standard Gaussian cdf. 

The Gaussian copula is usually expressed in terms of the correlation matrix, which is 
given by R = diag(cr)~ 1 S diag(cr) -1 . Note that the univariate marginal density for a Normal 
can be written as p(xj) = —(f>(uj) where Uj = ( X j — Hj)/a,j. The multivariate Normal density 
can thus be expressed as 

Pus(^i) • • • , %d) = j exp ( — u T Fr l u\ (3.11) 

* K (27r) d / 2 i#i i/2 nti^ V 2 / 



| ^ex P (-i^-/ ) .)n^l. (3.12 

j=l 3 



Since the distribution Fj of the jth variable satisfies Fj(xj) = — /ij)/<x,) = $(«;), we 

have that (Xj — fij)/o~j — §~ 1 (Fj(Xj)). The Gaussian copula density is thus 

cCFiCan), • • • , F d (x d )) = exp{-i$- 1 (F(x)) T ( J R- 1 - I)^ l {F{x))) 

(3.13) 

where = ($ _1 (Fi(xi)), . . . , §~ l (F d (x d ))). This is seen to be equivalent to (3.1) 

using the chain rule and the identity 

<-™ = *WWY (3 ' 14) 

3. 2. Estimation 

Let XW, . . . ,lW be a sample of size n where X« = (xf } , . . . ,xf) T G R d . We'll design 
a two-step estimation procedure where first the functions fj are estimated, and then the 
inverse covariance matrix Q is estimated, after transforming to approximately Normal. 
In light of (3.4) we define 

h j (x) = *- 1 (F j {x)) 
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where Fj is an estimator of Fj. A natural candidate for Fj is the marginal empirical distri- 
bution function 

1 n 

However, in this case hj(x) blows up at the largest and smallest values of Xj. For the 
high dimensional setting where n is small relative to d, an attractive alternative is to use a 
truncated or Winsorized 2 estimator: 

5 n if Fj(x) < 5 n 

Fj{x) = { Fj{x) if 6 n < Fj(x) <l-6 n (3.15) 
(l-5 n ) HF j {x)>l-6 n , 

where 5 n is a truncation parameter. There is a bias- variance tradeoff in choosing 5 n ; increasing 
5 n increases the bias while it decreases the variance. 

Given this estimate of the distribution of variable Xj, we then estimate the transformation 
function fj by 

fj(x) =J2j + v j h j (x) (3.16) 

where 

and jUj and Oj are the sample mean and standard deviation: 



ft'4£ A f a,ld s ' = \ ;E(^-«! 

8=1 \ i=l 

Now, let S n (f) be the sample covariance matrix of f(X^), . . . , f(X^); that is, 

Sn(f) = ^£ (/(* W ) " **»(/)) {f( Xii) )-»n(f)) T (3-17) 
i=l 
1 n ~ 



n 

i=l 



We then estimate f2 using S n (f). For instance, the maximum likelihood estimator is f2jf LE 
The ^-regularized estimator is 



tt n = argmin jtr (pS n {ffj - log |0| + (3.18) 



2 After Charles P. Winsor, the statistician whom John Tukey credited with his conversion from topology 
to statistics (Mallows, 1990). 
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where A is a regularization parameter, and = J2j=i ^2k=i \^jk\- The estimated graph 

is then E n = {(J, k) : Vt jk ^ 0}. 

Thus, we use a two-step procedure to estimate the graph. 

1. Replace the observations, for each variable, by their respective Normal scores, subject 
to a Winsorized truncation. 

2. Apply the graphical lasso to the transformed data to estimate the undirected graph. 

The first step is non-iterative and computationally efficient. The truncation parameter S n 
is chosen to be 

Sn = , 1/4 L-= (3.19) 
4n i / 4 v7r logn 

and does not need to be tuned. As will be shown in Theorem 3.1, such a choice makes the 
nonparanormal amenable to theoretical analysis. 

3.3. Statistical Properties of S n (f) 

The main technical result is an analysis of the covariance of the Winsorized estimator above. 
In particular, we show that under appropriate conditions, 



max 



Sn{f)jk S n (f)jk 



O 



log d + log 2 n 



p 



n 



1/2 



where S n (f)jk denotes the (J, k) entry of the matrix S n (f). This result allows us to leverage 
the significant body of theory on the graphical lasso (Ravikumar et al., 2009; Rothman et al., 
2008) which we apply in step two. 

Theorem 3.1. Suppose that d — w and let f be the Winsorized estimator defined in (3.16) 

with S n = -== . Define 

4n 1 / 4 v 7r logn 

C(M,0 = -^L(V2M-l) (M + 2) 
\/tt£ V / 



for M . ( > 0. Then for any e > C(M,£)\I —r- and sufficiently large n, we have 

y n L ' 2 

P (max S n {f) jk - S n {f) jk > e] < + + c 3 exp ( — °^ 6 \ , 

\ jk J (ne 2 ) 2 ? n M * 1 \ log d + log nj 

where ci,c 2 ,c 3 ,c 4 are positive constants. 



9 



The proof of this result involves a detailed Gaussian tail analysis, and is given in Liu et al. 
(2009). 

Using Theorem 3.1 and the results of Rothman et al. (2008) it can then be shown that the 
precision matrix is estimated at the following rates in the Frobenius norm and the ^-operator 
norm. 




s + d) log d + log 2 n 



n 



1/2 



' s log d + log 2 n 



n 



1/2 



||^n — ^o||f — Op 

and 

||^n — ^0 1| 2 — O 

where 

8 = Card ({(i, j) G {1, . . . , rf} x {1, . . . , rf} | n (i, j) ± 0, i + j}) 

is the number of nonzero off-diagonal elements of the true precision matrix. 

Using the results of Ravikumar et al. (2009), it can also be shown, under appropriate 
conditions, that the sparsity pattern of the precision matrix is estimated accurately with 
high probability. In particular, the nonparanormal estimator Q n satisfies 

P(g (f2 n ,fi )) > l-o(l) 

where Q(fl n ,Q ) is the event 

jsign (p n (j, k)J = sign (n \j, k)) , Vj, k e {1, . . . , d}j . 

We refer to Liu et al. (2009) for the details of the conditions and proofs. These Op(n~ 1//4 ) 
rates are slower than the Op(n~ l l 2 ) rates obtainable for the graphical lasso. However, in more 
recent work (Liu et al., 2012) we use estimators based on Spearman's rho and Kendall's tau 
statistics to obtain the parametric rate. 



4. Forest Density Estimation 

We now describe a very different, but equally flexible and useful approach. Rather than 
assuming a transformation to normality and an arbitrary undirected graph, we restrict the 
graph to be a tree or forest, but allow arbitrary nonparametric distributions. 

Let p*(x) be a probability density with respect to Lebesgue measure //(•) on R d and let 
X^, . . . , be n independent identically distributed M d - valued data vectors sampled from 
p* (x) where X W = (xf ) , . . . , xf ) . Let X s denote the range of xf and let X = X x x • • • x X d . 
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A graph is a forest if it is acyclic. If F is a <i-node undirected forest with vertex set 
Vf = {1, . . . , d} and edge set Ep C {1, . . . , d} x {1, . . . , d}, the number of edges satisfies 
\Ep\ < d. We say that a probability density function p(x) is supported by a forest F if the 
density can be written as 

m*>= n jg^y n **■>, <«) 

(i,j)eE F FK lJFK 31 kev F 

where each p(xi, Xj) is a bivariate density on Xi x Xj, and each p(xk) is a univariate density 
on A'fc. 

Let Fd be the family of forests with d nodes, and let Vd be the corresponding family of 
densities: 



Fd = ■yP > : J p(x) dfi(x) = 1, and p(x) satisfies (4.1) for some F G Fdj ■ (4.2) 
Define the oracle forest density 

q* = argminD(p*|| q) (4.3) 
where the Kullback-Leibler divergence D(p\\ q) between two densities p and q is 

D(p\\q)= [ p ( x )log^\dx, (4.4) 

Jx 

under the convention that 01og(0/g) = 0, and p\og(p/0) = oo for p ^ 0. The following is 
straightforward to prove. 

Proposition 4.1. Let q* be defined as in (4.3). There exists a forest F* 6 T&, such that 

**=^= n fu x ;X ) R p * [xk) (4 - 5) 

(ij)eE P . P {Xl>P {Xl> kev F * 
where p*(xi,Xj) andp*(xi) are the bivariate and univariate marginal densities of p* . 
For any density q(x), the negative log-likelihood risk R(q) is defined as 

R(q) = -Elogg(X) = - / p*(x) \ogq(x) dx. (4.6) 

Jx 

It is straightforward to see that the density q* defined in (4.3) also minimizes the negative 
log-likelihood loss: 

q* = argminD(p*|| q) = argmini?(g) (4-7) 

qev d ' qer d 

We thus define the oracle risk as R* = R(q*)- Using Proposition 4.1 and equation (4.1), 
we have 

R* = R{ q *) = R(p* F .) 
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nxi-,x j )+ H ( X ^ ( 4 - 8 ) 



(i,j)eE F * fcev, 



F* 



I{X t - X 3 ) = / p*(x t , Xj ) log f \ l \;> d Xi dXi (4.9) 



where 

P ip^ii Xj) 

Ix^ * ~' J ' ^ P*(xt) p*(Xj) 
is the mutual information between the pair of variables Xj, Xj and 

H{X k ) = -f p*(x k )\ogp*(x k )dx k (4.10) 
Jx k 

is the entropy. 

4-1- A Two-Step Procedure 

If the true density p*(x) were known, by Proposition 4.1, the density estimation problem 
would be reduced to finding the best forest structure F|, satisfying 

= argmini?(p^) = argminD(p*|| p* F ). (4-H) 

The optimal forest can be found by minimizing the right hand side of (4.8). Since the 
entropy term H(X) = J2k H(X k ) is constant across all forests, this can be recast as the 
problem of finding the maximum weight spanning forest for a weighted graph, where the 
weight of the edge connecting nodes % and j is I(Xi)Xj). Kruskal's algorithm (Kruskal, 
1956) is a greedy algorithm that is guaranteed to find a maximum weight spanning tree of a 
weighted graph. In the setting of density estimation, this procedure was proposed by Chow 
and Liu (1968) as a way of constructing a tree approximation to a distribution. At each 
stage the algorithm adds an edge connecting that pair of variables with maximum mutual 
information among all pairs not yet visited by the algorithm, if doing so does not form a 
cycle. When stopped early, after k < d — 1 edges have been added, it yields the best fc-edge 
weighted forest. 

Of course, the above procedure is not practical since the true density p*(x) is unknown. We 
replace the population mutual information I(Xi] Xj) in (4.8) by a plug-in estimate I n (Xi; Xj), 
defined as 

T n (X i ;X j )=[ p n (xi, Xj) log y^ Xl l!°^ dxidxj (4.12) 

J Xi x Xj Pn{Xi) Pn [Xj ) 

where p n (xi,Xj) and p n {xi) are bivariate and univariate kernel density estimates. Given 



this estimated mutual information matrix M n 



I n {Xi] Xj 



, we can then apply Kruskal's 



algorithm (equivalently, the Chow-Liu algorithm) to find the best tree structure F n . 
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Since the number of edges of F n controls the number of degrees of freedom in the final 
density estimator, an automatic data-dependent way to choose it is needed. We adopt the 
following two-stage procedure. First, we randomly split the data into two sets V>\ and T>2 of 
sizes rii and n 2 \ we then apply the following steps: 

1. Using T>x, construct kernel density estimates of the univariate and bivariate marginals 
and calculate I ni (Xi\ Xj) for i,j G {!,...,<£} with i ^ j. Construct a full tree Fn v ^ 
with d — 1 edges, using the Chow-Liu algorithm. 

2. Using V 2 , prune the tree F^' 1 " 1 to find a forest F$ with k edges, for < k < d — 1. 

Once Fnx is obtained in Step 2, we can calculate p~<%) according to (4.1), using the kernel 
density estimates constructed in Step 1. 



4-1.1. Step 1: Constructing a sequence of forests 

Step 1 is carried out on the dataset T>\. Let K(-) be a univariate kernel function. Given 
an evaluation point (xi,Xj), the bivariate kernel density estimate for (Xi,Xj) based on the 
observations {X^ s \ Xj } s e©i is defined as 

, M = (*£z*) K (3^) , (413) 

where we use a product kernel with h 2 > as the bandwidth parameter. The univariate 
kernel density estimate p ni (x^) for X^ is 




where h\ > is the univariate bandwidth. 

We assume that the data lie in a rf-dimensional unit cube X = [0, l] d . To calculate the 
empirical mutual information I ni (Xf, Xj), we need to numerically evaluate a two-dimensional 
integral. To do so, we calculate the kernel density estimates on a grid of points. We choose m 
evaluation points on each dimension, xu < x 2 i < ■ ■ ■ < x m i for the ith variable. The mutual 
information / ni (JQ; Xj) is then approximated as 

The approximation error can be made arbitrarily small by choosing m sufficiently large. 
As a practical concern, care needs to be taken that the factors p ni (xki) and p ni (xgj) in the 
denominator are not too small; a truncation procedure can be used to ensure this. Once 
the d x d mutual information matrix M ni = J ni (JQ; Xj) is obtained, we can apply the 
Chow-Liu (Kruskal) algorithm to find a maximum weight spanning tree. 
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Tree Construction (Kruskal/Chow-Liu) 



Input: Data set T>\ and the bandwidths hi, hi- 

Initialize: Calculate M ni , according to (4.13), (4.14), and (4.15). 

Set = 

For k = l,...,d- 1: 

1. Set (i^ k \j^) argmax^ -\ M ni (i,j) such that U {(i^- k \ j^)} does not contain a cycle; 

2. <^ E( k -V\J{(iW,jW)}. 

Output: tree F^ -1 ^ with edge set E^ d ~ l \ 
4-1-2. Step 2: Selecting a forest size 

The full tree F^f obtained in Step 1 might have high variance when the dimension d is 
large, leading to overfitting in the density estimate. In order to reduce the variance, we prune 
the tree; that is, we choose an unconnected tree with k edges. The number of edges k is a 
tuning parameter that induces a bias- variance tradeoff. 

In order to choose k, note that in stage k of the Chow-Liu algorithm we have an edge 
set (in the notation of the Algorithm 4.1.1) which corresponds to a forest F$ with k 
edges, where F$ is the union of d disconnected nodes. To select k, we cross-validate over 



Let p n2 (xi,Xj) and p n2 ( x k) be defined as in (4.13) and (4.14), but now evaluated solely 
based on the held-out data in T> 2 - For a density pp that is supported by a forest F, we define 
the held-out negative log-likelihood risk as 



the d forests f£>, F&\ ...,F$- 




p(Xi)p(Xj) 



p(Xi, Xj) 




p n2 (x k ) log p(x k )dx k 



(4.16) 




(4.17) 



and where puk) is computed using the density estimate p ni constructed on T>i. 



We can also estimate k as 




J] p ni (X^) (4.18) 



(4.19) 
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This minimization can be efficiently carried out by iterating over the d — 1 edges in Fnf ^ ■ 
Once k is obtained, the final forest-based kernel density estimate is given by 

n /z [ ?~ x ? ^ n^w- ( 4 - 2 °) 

Another alternative is to compute a maximum weight spanning forest, using Kruskal's 
algorithm, but with heldout edge weights 

{^(ij) = -y log P^( x y x /] (421) 

In fact, asymptotically (as n 2 —> oo) this gives optimal tree-based estimator constructed in 
terms of the kernel density estimates p ni . 

4-2. Statistical Properties 

The statistical properties of forest density estimator can be analyzed under the same type 
of assumptions that are made for classical kernel density estimation. In particular, assume 
that the univariate and bivariate densities lie in a Holder class with exponent (5. Under 
this assumption the minimax rate of convergence in the squared error loss is 0(n p/ W +1) ) for 
bivariate densities and 0(n 2/3//< - 2/3+1 - ) ) for univariate densities. Technical assumptions on the 
kernel yield L ro concentration results on kernel density estimation (Gine and Guillou, 2002). 

Choose the bandwidths h\ and h 2 to be used in the one-dimensional and two-dimensional 
kernel density estimates according to 



logn 



n 



1+2/3 



h x x (4.22) 



,'loerA 2 + 2 ' 3 

/ia * — J • ( 4 - 23 ) 



n 

(k) 

This choice of bandwidths ensures the optimal rate of convergence. Let V d be the family 
of d- dimensional densities that are supported by forests with at most k edges. Then 

Vf ] CV^ ] C ■■■ CV ( /- 1] . (4.24) 

Due to this nesting property, 

inf R(q F ) > inf R(q F ) > ■ ■ > inf R(q F ). (4.25) 

This means that a full spanning tree would generally be selected if we had access to the true 
distribution. However, with access to finite data to estimate the densities {p ni ) the optimal 
procedure is to use fewer than d — 1 edges. The following result analyzes the excess risk 
resulting from selecting the forest based on the heldout risk R n2 . 
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Theorem 4.1. Letp^k) be the estimate with \E~(k) \ = k obtained after the first k iterations 
of the Chow-Liu algorithm. Then under (omitted) technical assumptions on the densities and 
kernel, for any 1 < k < d — 1, 

R{ Pf^)~ mf } R(q F ) = P (k\J XOg ™^ d + d ^^dm j ^ 

and 

~ Md-i R{ ^ k)) = 0p ( (r + ^V^fS^ +d V^S^) (4 - 27) 

where k = argmin 0<fc<d _ 1 # n2 (pW fe) ) and k* = argmin 0<fc<d _ 1 R{p^ k) ). 

- - r d - - r d 

The main work in proving this result lies in establishing bounds such as 

sup \R(p F )-R n2 (p F )\ = P (<j) n (k) + Md)) (4.28) 

F^ d k) 

where R n2 is the heldout risk, under the 

<Pn{k) = 

For the proof of this and related results, 

R{p P (k) ) - R{pp(.k*)) = Riv^k) ) - R n2 (pWs) ) + R n2 {PzCk) ) - R(Pp(*n) (4.31) 

r d d r d r d r d d 

= P {(j> n (k) + ip n (d)) + Rn 2 {p^)) - R(Pp^) ( 4 - 32 ) 

^ d d 

< P (<f) n (k) + Vn(<0) + Rn 2 {p P ( k *)) - R(Pp(^) (4.33) 

d d 

= Op(Mk) + Mk*)+Md)) • (4-34) 

where (4.33) follows from the fact that k is the minimizer of R n2 (-). This result allows the 
dimension d to increase at a rate o y n 2/3 ^ 1+2 ^ / log nj and the number of edges k to increase 

at a rate o ^ a/ nfil ( 1 +/ 3 ) / log nj , with the excess risk still decreasing to zero asymptotically. 

Note that the minimax rate for 2-dimensional kernel density estimation under our stated 
conditions is n~^^ l3+1 \ The rate above is essentially the square root of this rate, up to 
logarithmic factors. This is because a higher order kernel is used, which may result in negative 
values. Once we correct these negative values, the resulting estimated density will no longer 
integrate to one. The slower rate is due to a very simple truncation technique to correct 



notation 



N (4-29) 



'lOj 


g n + log 


;d 


n /3/(/3+l) 


'loj 


I n + log 


d 



d y • ( 4 - 3 °) 

see Liu et al. (2011). Using this, one easily obtains 
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Arabidopsis thaliana is a small flower- 
ing plant; it was the first plant genome 
to be sequenced, and its roughly 27,000 
genes and 35,000 proteins have been ac- 
tively studied. Here we consider a data set 
based on Affymetrix GeneChip microar- 
rays with sample size n = 118, for which 
d — 40 genes have been selected for anal- 
ysis. 



source: wikipedia.org 



the higher-order kernel density estimator to estimate mutual information. Current work is 
investigating a different version of the higher order kernel density estimator with more careful 
correction techniques, for which it is possible to achieve the optimal minimax rate. 

In theory the bandwidths are chosen as in (4.22) and (4.23), assuming is known. In 
our experiments presented below, the bandwidth h k for the 2-dimensional kernel density 
estimator is chosen as 

h k = 1.06 • min |g fc , ^-to | _ n -i/( 2/ 3 + 2) ) (435) 

where a k is the sample standard deviation of {-X^ }«ex>i an d 9fc,o.75, 9fc,o.25 are the 75% and 
25% sample quantiles of {X^} seVll with = 2. 

5. Examples 

5.1. Gene- Gene Interaction Graphs 

The nonparanormal and Gaussian graphical model can construct very different graphs. Here 
we consider a data set based on Affymetrix GeneChip microarrays for the plant Arabidopsis 
thaliana, (Wille et al., 2004). The sample size is n = 118. The expression levels for each chip 
are pre-processed by log-transformation and standardization. A subset of 40 genes from the 
isoprenoid pathway is chosen for analysis. 

While these data are often treated as multivariate Gaussian, the nonparanormal and the 
glasso give very different graphs over a wide range of regularization parameters, suggesting 
that the nonparametric method could lead to different biological conclusions. 

The regularization paths of the two methods are compared in Figure 4. To generate 
the paths, we select 50 regularization parameters on an evenly spaced grid in the inter- 



17 



glasso path 



nonparanormal path 




val [0.16, 1.2]. Although the paths for the two methods look similar, there are some subtle 
differences. In particular, variables become nonzero in a different order. 

Figure 5 compares the estimated graphs for the two methods at several values of the 
regularization parameter A in the range [0.16,0.37]. For each A, we show the estimated 
graph from the nonparanormal in the first column. In the second column we show the graph 
obtained by scanning the full regularization path of the glasso fit and finding the graph 
having the smallest symmetric difference with the nonparanormal graph. The symmetric 
difference graph is shown in the third column. The closest glasso fit is different, with edges 
selected by the glasso not selected by the nonparanormal, and vice-versa. The estimated 
transformation functions for several genes are shown Figure 6, which show non-Gaussian 
behavior. 

Since the graphical lasso typically results in a large parameter bias as a consequence of 
the l\ regularization, it sometimes make sense to use the refit glasso, which is a two-step 
procedure — in the first step, a sparse inverse covariance matrix is obtained by the graphical 
lasso; in the second step, a Gaussian model is refit without l\ regularization, but enforcing 
the sparsity pattern obtained in the first step. 

Figure 7 compares forest density estimation to the graphical lasso and refit glasso. It can 
be seen that the forest-based kernel density estimator has better generalization performance. 
This is not surprising, given that the true distribution of the data is not Gaussian. (Note that 
since we do not directly compute the marginal univariate densities in the nonparanormal, we 
are unable to compute likelihoods under this model.) The held-out log-likelihood curve for 
forest density estimation achieves a maximum when there are only 35 edges in the model. 
In contrast, the held-out log-likelihood curves of the glasso and refit glasso achieve maxima 
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Fig 5. The nonparanormal estimated graph for three values of X — 0.2448,0.2661,0.30857 (left column), the 
closest glasso estimated graph from the full path (middle) and the symmetric difference graph (right). 




FlG 6. Estimated transformation functions for four genes in the microarray data set, indicating non-Gaussian 
marginals. The corresponding genes are among the nodes appearing in the symmetric difference graphs above. 
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Fig 7. Results on microarray data. Top: held-out log-likelihood of the forest density estimator (black 
step function), glasso (red stars), and refit glasso (blue circles). Bottom: estimated graphs using the 
forest-based estimator (left) and the glasso (right), using the same node layout. 
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when there are around 280 edges and 100 edges respectively, while their predictive estimates 
are still inferior to those of the forest-based kernel density estimator. Figure 7 also shows the 
estimated graphs for the forest-based kernel density estimator and the graphical lasso. The 
graphs are automatically selected based on held-out log-likelihood, and are clearly different. 

5.2. Graphs for Equities Data 

For the examples in this section we collected stock price data from Yahoo! Finance (finance . 
yahoo.com). The daily closing prices were obtained for 452 stocks that consistently were in 
the S&P 500 index between January 1, 2003 through January 1, 2011. This gave us altogether 
2,015 data points, each data point corresponds to the vector of closing prices on a trading 
day. With S t j denoting the closing price of stock j on day t, we consider the variables 
X t j = log (Sij/Si-ij) and build graphs over the indices j. We simply treat the instances 
X t as independent replicates, even though they form a time series. The data contain many 
outliers; the reasons for these outliers include splits in a stock, which increases the number of 
shares. We Winsorize (or truncate) every stock so that its data points are within three times 
the mean absolute deviation from the sample average. The importance of this Winsorization 
is shown below (see the "snake graph" in Figure 9). For the following results we use the 
subset of the data between January 1, 2003 to January 1, 2008, before the onset of the 
"financial crisis." It is interesting to compare to results that include data after 2008, but we 
omit these for brevity. 

The 452 stocks are categorized into 10 Global Industry Classification Standard (GICS) 
sectors, including Consumer Discretionary (70 stocks), Consumer Staples (35 stocks), 
Energy (37 stocks), Financials (74 stocks), Health Care (46 stocks), Industrials (59 
stocks), Information Technology (64 stocks), Materials (29 stocks), Telecommunications 
Services (6 stocks), and Utilities (32 stocks). In the graphs shown below, the nodes are 
colored according to the GICS sector of the corresponding stock. It is expected that stocks 
from the same GICS sectors should tend to be clustered together, since stocks from the same 
GICS sector tend to interact more with each other. This is indeed this case; for example, 
Figure 8 shows examples of the neighbors of two stocks, Yahoo Inc. and Target Corp., in the 
forest density graph. 

Figures 9(a)-(c) show graphs estimated using the glasso, nonparanormal, and forest den- 
sity estimator on the data from January 1, 2003 to January 1, 2008. There are altogether 
n = 1,257 data points and d = 452 dimensions. To estimate the glasso graph, we somewhat 
arbitrarily set the regularization parameter to A = .55, which results a graph that has 1,316 
edges, about 3 neighbors per node, and good clustering structure. The resulting graph is 
shown in Figure 9(a). The corresponding nonparanormal graph is shown in Figure 9(b). The 
regularization is chosen so that it too has 1,316 edges. Only nodes that have neighbors in 
one of the graphs are shown; the remaining nodes are disconnected. 

Since our dataset contains n = 1,257 data points, we directly apply the forest density 
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Fig 8. Example neighborhoods in a forest graph for two stocks, Yahoo Inc. and Target Corp. The correspond- 
ing GICS industries are shown in parentheses. ('Consumer Discr. is short for Consumer Discretionary, 
and Information Tech. is short for Information Technology .) 

estimator on the whole dataset to obtain a full spanning tree of d — 1 = 451 edges. This 
estimator turns out to be very sensitive to outliers, since it exploits kernel density estimates 
as building blocks. In Figure 9(d) we show the estimated forest density graph on the stock 
data when outliers are not trimmed by Winsorization. In this case the graph is anomolous, 
with a snake-like character that weaves in and out of the 10 GICS industries. Intuitively, 
the outliers make the two-dimensional densities appear like thin "pancakes," and densities 
with similar orientations are clustered together. To address this, we trim the outliers by 
Winsorizing at 3 MADs, as described above. Figure 9(c) shows the estimated forest graph, 
restricted to the same stocks shown for the graphs in (a) and (b). The resulting graph has 
good clustering with respect to the GICS sectors. 

Figures 10(a)-(c) display the differences and edges common to the glasso, nonparanormal, 
and forest graphs. Figure 10(a) shows the symmetric difference between the estimated glasso 
and nonparanormal graphs, and Figure 10(b) shows the common edges. Figure 10(c) shows 
the symmetric difference between the nonparanormal and forest graphs, and Figure 10(d) 
shows the common edges. 

We refrain from drawing any hard conclusions about the effectiveness of the different 
methods based on these plots — how these graphs are used will depend on the application. 
These results serve mainly to highlight how very different inferences about the independence 
relations can arise from moving from a Gaussian model to a semiparametric model to a fully 
nonparametric model with restricted graphs. 

6. Related Work 

There is surprisingly little work on structure learning of nonparametric graphical models in 
high dimensions. One piece of related work is sparse log-density smoothing spline ANOVA 
models, introduced by Jeon and Lin (2006). In such a model the log-density function is 
decomposed as the sum of a constant term, one dimensional functions (main effects), two- 
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Fig 9. Graphs build on S&P 500 stock data from Jan. 1, 2003 to Jan. 1, 2008. The graphs are 
estimated using (a) the glasso, (b) the nonparanormal, and (c) forest density estimation. The nodes 
are colored according to their GICS sector categories. Nodes are not shown that have zero neighbors 
in both the glasso and nonparanormal graphs. Figure (d) shows the maximum weight spanning tree 
that results if the data are not Winsorized to trim outliers. 
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(a) difference between glasso and nonparanormal (b) edges common to giasso and nonparanormal 




(c) difference between nonparanormal and FDE (d) edges common to nonparanormal and FDE 



Fig 10. Visualizations of the differences and similarities between the estimated graphs. The symmet- 
ric difference between the glasso and nonparanormal graphs is shown in (a), and the edges common 
to the graphs are shown in (b). Similarly, the symmetric difference between the nonparanormal and 
forest density estimate is shown in (c), and the common edges are shown in (d). 
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dimensional functions (two-way interactions), and so on: 



\ogp(x) = f(x) ^c + ^fjixj) + ^2f jk (xj,x k ) + ■■■ . (6.1) 

3=1 j<k 

The component functions satisfy certain constraints so that the model is identifiable. In high 
dimensions, the model is truncated up to second order interactions so that the computation 
is still tractable. There is a close connection between the log-density ANOVA model and 
undirected graphical models. For a model with only main effects and two-way interactions, 
we define a graph G = (V, E) such that G E if and only if ^ 0. It can be seen that 
p(x) is Markov to G. Jeon and Lin (2006) assume that these component functions belong 
to certain reproducing kernel Hilbert spaces (RKHSs) equipped with a RKHS norm || ■ ||#. 
To obtain a sparse estimation of the component functions f(x), they propose a penalized 
M-estimator 

1 n f 
/ = argmax{-^exp(/(X«)) + / f(x)p(x)dx + \J(f)}, (6.2) 

f i=l J 

where p(x) is some pre-defined positive density and J(f) is a sparsity-inducing penalty that 
takes the form 

d 

3=1 j<k 

Solving (6.2) only requires one-dimensional integrals which can be efficiently computed. 
However, the optimization in (6.2) exploits a surrogate loss instead of the log-likelihood loss, 
and is more difficult to analyze theoretically. 

Another related idea is to conduct structure learning using nonparametric decomposable 
graphical models (Schwaighofer et al., 2007). A distribution is a decomposable graphical 
model if it is Markov to a graph G = (V, E) which has a junction tree representation, 
which can be viewed as an extension of tree-based graphical models. A junction tree yields 
a factorized form 

where Vr denotes the set of cliques in V and E T is the set of separators, i.e., the intersection of 
two neighboring cliques in the junction tree. Exact search for the junction tree structure that 
maximizes the likelihood is usually computationally expensive. Schwaighofer et al. (2007) 
propose a forward-backward strategy for nonparametric structure learning. However, such a 
greedy procedure does not guarantee that the global optimal solution is found, and makes 
theoretical analysis challenging. 
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7. Discussion 



This paper has considered undirected graphical models for continuous data, where the general 
densities take the form 



Such a general family is at least as difficult as the general high-dimensional nonparametric 
regression model. But, as for regression, simplifying assumptions can lead to tractable and 
useful models. We have considered two approaches that make very different tradeoffs between 
statistical generality and computational efficiency The nonparanormal relies on estimating 
one-dimensional functions, in a manner that is similar to the way additive models estimate 
one-dimensional regression functions. This allows arbitrary graphs, but the distribution is 
semiparametric, via the Gaussian copula. At the other extreme, when we restrict to acyclic 
graphs we can have fully nonparametric bivariate and univariate marginals. This leverages 
classical techniques for low- dimensional density estimation, together with approximation 
algorithms for constructing the graph. Clearly these are just two among many possibilities 
for nonparametric graphical modeling. We conclude, then, with a brief description of a few 
potential directions for future work. 

As we saw with the nonparanormal, if only the graph is of interest, it may not be im- 
portant to estimate the functions accurately. More generally, to estimate the graph it is not 
necessary to estimate the density. One of the most effective and theoretically well-supported 
methods for estimating Gaussian graphs is due to Meinshausen and Buhlmann (2006). In 
this approach, we regress each variable Xj onto all other variables (Xk)k^j using the lasso. 
This directly estimates the set of neighbors M(J) — {k | (j, k) G E} for each node j in the 
graph, but the covariance matrix is not directly estimated. Lasso theory gives conditions and 
guarantees on these variable selection problems. This approach was adapted to the discrete 
case by Ravikumar et al. (2010), where the normalizing constant and thus the density can't 
be efficiently computed. This general strategy may be attractive for graph selection in non- 
parametric graphical models. In particular, each variable could be regressed on the others 
using a nonparametric regression method that performs variable selection; one such method 
with theoretical guarantees is due to Lafferty and Wasserman (2008). 

A different framework for nonparametricity involves conditioning on a collection of ob- 
served explanatory variables Z. Liu et al. (2010) develop a nonparametric procedure called 
Graph- optimized CART, or Go-CART, to estimate the graph conditionally under a Gaus- 
sian model. The main idea is to build a tree partition on the Z space just as in CART 
(classification and regression trees), but to estimate a graph at each leaf using the glasso. 
Oracle inequalities on risk minimization and model selection consistency were established 
for Go-CART by Liu et al. (2010). When Z is time, graph- valued regression reduces to the 
time-varying graph estimation problem (Chen et al., 2010; Kolar et al., 2009; Zhou et al., 





2010). 
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Another fruitful direction is the introduction of latent variables. Even though the graph- 
ical model of the observed variables X may be complex, when conditioned on some latent 
explanatory variables Z, the graph may be simplified. One straightforward approach is to 
build mixtures of the models we consider here. A mixture of nonparanormals will require 
new methods, to compute the derivatives f'Axj). A mixture of forests could be implemented 
using a kind of nonparametric EM algorithm, with kernel density estimates over weighted 
data in the M-step. But it is not easy to read off a graph from a mixture model. 

In parametric settings, Chandrasekaran et al. (2010) and Choi et al. (2010) develop al- 
gorithms and theory for learning graphical models with latent variables. The first paper 
assumes the joint distribution of the observed and latent variables is a Gaussian graphical 
model, and the second paper assumes the joint distribution is discrete and factors according 
to a forest. Since the nonparanormal and forest density estimator are nonparametric versions 
of the Gaussian and forest graphical models for discrete data, we expect similar techniques to 
those of Chandrasekaran et al. (2010); Choi et al. (2010) can be used to extend our methods 
to handle latent variables. It would also be of interest to formulate nonparametric extensions 
of low rank plus sparse covariance matrices. 

No matter how the methodology develops, nonparametric graphical models will at best 
be approximations to the true distribution in many applications. Yet, there is plenty of 
experience to show how incorrect models can be useful. An ongoing challenge in nonpara- 
metric graphical modeling will be to better understand how the structure can be accurately 
estimated even when the model is wrong. 
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